Add proposal for per-tenant cardinality API#7335
Add proposal for per-tenant cardinality API#7335CharlieTLe wants to merge 6 commits intocortexproject:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Charlie Le <charlie_le@apple.com>
|
|
||
| Currently, Cortex tenants lack visibility into which metrics, labels, and label-value pairs contribute the most series in ingesters. Without this information, debugging high-cardinality issues requires operators to inspect TSDB internals directly on ingester instances, which is impractical in a multi-tenant, distributed environment. | ||
|
|
||
| Prometheus itself exposes a `/api/v1/status/tsdb` endpoint that provides cardinality statistics from the TSDB head. This proposal brings equivalent functionality to Cortex as a multi-tenant, distributed API. |
There was a problem hiding this comment.
I am not a fan of TSDB status API name... Prometheus API might change and add more stuff. A dedicated api/v1/cardinality might be better?
There was a problem hiding this comment.
I agree. We might not use the prometheus TSDB in the future.
|
|
||
| ## Out of Scope | ||
|
|
||
| - **Long-term storage cardinality analysis**: This endpoint only covers in-memory TSDB head data in ingesters. Analyzing cardinality across compacted blocks in object storage is a separate concern. A future long-term cardinality API could reuse portable fields (see [Extensibility](#extensibility-to-long-term-storage)) or introduce a separate endpoint. |
There was a problem hiding this comment.
Do we plan to have a different API for long term storage cardinality? We should aim for the same API endpoint even though we don't have to design for it now
There was a problem hiding this comment.
I agree we should plan for this too. Probably sooner than later
|
|
||
| Expose per-tenant TSDB head cardinality statistics via a REST API endpoint on the Cortex query path. The endpoint should: | ||
|
|
||
| 1. Be compatible with the Prometheus `/api/v1/status/tsdb` response format. |
There was a problem hiding this comment.
I am not sure if this needs to be as part of the goal. Does it need to be compatible.
I think our API response format is already incompatible today
| ``` | ||
|
|
||
| - **Authentication**: Requires `X-Scope-OrgID` header (standard Cortex tenant authentication). | ||
| - **Query Parameter**: `limit` (optional, default 10) - controls the number of top items returned per category. |
There was a problem hiding this comment.
I agree, we need start and end. Sometimes cardinality issues are specific in time
| message TSDBStatusResponse { | ||
| uint64 num_series = 1; | ||
| int64 min_time = 2; | ||
| int64 max_time = 3; |
There was a problem hiding this comment.
Do we need min max? How do we aggregate this in the final response? min(min_t) and max(max_t)?
|
|
||
| 2. **`chunkCount` omitted**: Prometheus includes a `chunkCount` field (from `prometheus_tsdb_head_chunks`). In a distributed system with replication, chunk counts across ingesters cannot be meaningfully aggregated — chunks are an ingester-local storage detail, and summing/dividing by the replication factor does not produce a useful number. | ||
|
|
||
| **Open question**: Should we adopt the `headStats` wrapper to maintain client compatibility with Prometheus tooling? The trade-off is compatibility vs simplicity — the flat format is easier to consume for Cortex-specific clients, but adopting the Prometheus format would allow reuse of existing client libraries. |
There was a problem hiding this comment.
Any Prometheus tool consumes this today? Why compatibility is a concern
| | `labelValueCountByLabelName` | No | Portable to block storage | | ||
| | `seriesCountByLabelValuePair` | No | Portable to block storage | | ||
| | `memoryInBytesByLabelName` | **Yes** | In-memory byte usage has no analogue in object storage | | ||
| | `minTime` / `maxTime` | **Yes** | Reflects head time range, not total storage | |
There was a problem hiding this comment.
Do we need to add those head specific fields?
…ore gateways Add source=blocks query parameter to analyze cardinality from compacted blocks in object storage. The blocks path fans out to store gateways, which compute statistics from block index headers (cheap label value counts) and posting list expansion (exact series counts per metric). Results are cached per immutable block. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Le <charlie_le@apple.com>
…plify Address feedback from PR cortexproject#7335 review: - Rename endpoint from /api/v1/status/tsdb to /api/v1/cardinality - Drop Prometheus compatibility as a goal - Add start/end time range query parameters - Drop head-specific fields (numLabelPairs, memoryInBytesByLabelName, minTime, maxTime) to unify response across both sources - Remove API Compatibility and Field Portability sections Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Le <charlie_le@apple.com>
…limit Make start/end required for source=blocks to prevent unbounded block scanning. Add cardinality_max_query_range per-tenant limit (default 24h) to give operators control over the blast radius. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Le <charlie_le@apple.com>
Critical:
- Fix blocks path aggregation: no SG RF division since GetClientsFor
routes each block to exactly one store gateway
Significant:
- Add min_time, max_time, block_ids to store gateway CardinalityRequest
- Specify MaxErrors=0 for head path with availability implications
- Add consistency check and retry logic for blocks path
- Document RF division as best-effort approximation
Moderate:
- Wrap responses in standard {status, data} Prometheus envelope
- Change HTTP 422 to HTTP 400 for limit violations
- Add Error Responses section with all validation scenarios
- Add approximated field for block overlap and partial results
- Add Observability section with metrics
- Add per-tenant concurrency limit and query timeout
- Reject start/end for source=head instead of silently ignoring
Low:
- Add Rollout Plan with phased approach and feature flag
- Document rolling upgrade compatibility (Unimplemented handling)
- Document Query Frontend bypass
- Improve caching: full results keyed by ULID, limit at response time
- Add missing files to implementation section
- Move shared proto to pkg/cortexpb/cardinality.proto
- Rename TSDBStatus* to Cardinality* throughout
- Add limit upper bound (max 512)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Charlie Le <charlie_le@apple.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Le <charlie_le@apple.com>
Summary
Proposal for a per-tenant cardinality API (
GET /api/v1/cardinality) that exposes cardinality statistics (top metrics by series count, top labels by value count, top label-value pairs by series count) across two data sources:source=head: Fans out to ingesters via the distributor, aggregates TSDB head stats with RF-based deduplication.source=blocks: Fans out to store gateways viaBlocksFinder+GetClientsFor, computes cardinality from block indexes with per-block caching.Key design points:
start/endrequired for blocks path, rejected for head path (head cannot sub-filter)cardinality_api_enabled,cardinality_max_query_range,cardinality_max_concurrent_requests,cardinality_query_timeout{status, data}Prometheus response envelope withapproximatedfield for block overlap / partial results🤖 Generated with Claude Code